Skip to content

Conversation

@my-vegetable-has-exploded
Copy link
Contributor

@my-vegetable-has-exploded my-vegetable-has-exploded commented Dec 30, 2025

Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.

In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.

Why are these changes needed?

Related issue number

Closes #4281

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
@my-vegetable-has-exploded my-vegetable-has-exploded marked this pull request as ready for review December 30, 2025 11:30
}
}

// isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?

https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#test-the-collector-on-the-kind-cluster

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Jan 1, 2026

Hi @my-vegetable-has-exploded, thanks for your contribution.

Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an emptyDir volume.

If I have misunderstood anything, please correct me. Thanks!

Comment on lines 52 to +53
r.prevLogsDir = "/tmp/ray/prev-logs"
r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @justinyeh1995,

Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.

Copy link
Contributor

@justinyeh1995 justinyeh1995 Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this issue. Thanks a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for heads up!

@my-vegetable-has-exploded my-vegetable-has-exploded changed the title feat(historyserver):re-push prev-logs on pod restart feat(history server):re-push prev-logs on container restart Jan 4, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz also add e2e test in this PR.

@my-vegetable-has-exploded
Copy link
Contributor Author

plz also add e2e test in this PR.

Get! I would add it tonight.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lorriexingfang to review

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using filepath functions to handle file path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed to handle it as a followup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursor review

@Future-Outlier
Copy link
Member

bugbot run

@Future-Outlier
Copy link
Member

cursor review

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
@Future-Outlier
Copy link
Member

cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you!

Signed-off-by: my-vegetable-has-exploded <[email protected]>
Comment on lines 505 to 507
// 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down.
// Note: We only inject logs, not node_events, because the collector's prev-logs processing
// currently only handles the logs directory. node_events are handled by the EventServer separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. How can we ensure that the injection is completed during the collector downtime?

Copy link
Contributor Author

@my-vegetable-has-exploded my-vegetable-has-exploded Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the injection completes quickly while the collector takes some time to become ready, and this approach works in my local environment as well. At the moment, I don’t have a more robust solution in mind.

cc @seanlaii @CheyuWu Do you have any suggestions? Thanks so much!

@CheyuWu CheyuWu self-requested a review January 10, 2026 01:49
return
}

// Check if this directory has already been processed by checking in persist-complete-logs
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.

Copy link
Contributor Author

@my-vegetable-has-exploded my-vegetable-has-exploded Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103

	completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs")
	if _, err := os.Stat(completeDir); err == nil {
		logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID)
		return
	}

And e2e test also covers this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!

}

// Walk through the logs directory and process all files
err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.

Co-authored-by: 江家瑋 <[email protected]>
Signed-off-by: yi wang <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Comment on lines +144 to +166

## Troubleshooting

### "too many open files" error

If you encounter `level=fatal msg="Create fsnotify NewWatcher error too many open files"` in the collector logs,
it is likely due to the inotify limits on the Kubernetes nodes.

To fix this, increase the limits on the **host nodes** (not inside the container):

```bash
# Apply changes immediately
sudo sysctl -w fs.inotify.max_user_instances=8192
sudo sysctl -w fs.inotify.max_user_watches=524288
```

To make these changes persistent across reboots, use the following lines:

```text
echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.conf
echo "fs.inotify.max_user_watches=524288" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool, never heard this, I learned something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just revise it, thank you!
cc @rueian to merge

@Future-Outlier Future-Outlier moved this from collector to can be merged in @Future-Outlier's kuberay project Jan 14, 2026
@Future-Outlier Future-Outlier moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 14, 2026
@Future-Outlier Future-Outlier moved this from Done to can be merged in @Future-Outlier's kuberay project Jan 14, 2026
@Future-Outlier Future-Outlier changed the title feat(history server):re-push prev-logs on container restart [historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[history server][collector] Implement strategy for handling duplicate logs on pod restart

6 participants